Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subvert insert_overwrite merge strategy to bring back merge #371

Closed
wants to merge 1 commit into from

Conversation

elyobo
Copy link

@elyobo elyobo commented Nov 2, 2022

Resolves #16

Description

For comment only to see if there's any interest in the idea, not necessarily this particular implementation.

The insert overwrite implementation uses BigQuery scripting variables to extract store the partition values that will be updated, allowing efficient pruning of the insert query with substantial cost and performance benefits. This works well when those values are small enough to fit inside BQ scripting's memory limit of 1 MiB as described in #16, which I guess is going to be hit at 16,384 INT64 IDs.

Prior to the current insert overwrite strategy, @jtcohen6 worked on a similar approach that used minimum and maximum partition values instead in dbt-labs/dbt-core#1971, which works around the memory limit because only two values will be selected. It's not as efficient (unnecessary partitions between the min and the max will still be read), but the worst case performance is that it's no better than the merge + temp table cost (in which case users can opt to use the merge strategy) and in the best case (a contiguous block of IDs, e.g. in append only data) then it will prune partitions as well as the insert overwrite.

This PR just tinkers with the insert overwrite to bring back this approach as an optional variant, but it's probably confusing UX to have this very-much-not-an-insert-overwrite inside the "insert overwrite" strategy, and it has no tests - I made it to confirm that it was possible and that it worked in practice not just in my head, and it did offer significant improvements for me.

If there's interest in bringing this back I'm happy to work on an improved version that addresses these and any other issues raised.

Alternatives Considered

Users can also implement this themselves (to a large extent, at least; I had SQL header issues) by implementing their own bigquery__get_merge_sql. I added one in #16 (comment) as an example which is working for me although, as mentioned, I had SQL header issues - they're output to build the temp table and I can't seem to suppress them, so they're output again for the real select.

It would be more convenient, simple, and less error prone if this was supported directly.

Users can also fall back to the existing merge approach, but this can be much more expensive.

Checklist

@cla-bot
Copy link

cla-bot bot commented Nov 2, 2022

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin.

CLA has not been signed by users: @elyobo

@elyobo
Copy link
Author

elyobo commented Nov 2, 2022

Happy to sign the CLA if there's interest in getting this merged bot :) I did have a minor PR in dbt-utils, different CLA requirements there?

@elyobo
Copy link
Author

elyobo commented Nov 24, 2022

Would appreciate any feedback on the idea here; it's a valuable cost saver for BigQuery users where the insert/overwrite strategy doesn't work due to memory limits in BQ "scripts" e.g. as described in #16.

@elyobo
Copy link
Author

elyobo commented Feb 9, 2023

Implementation being discussed in #16 but this doesn't look like the way we'll be going so closing.

@elyobo elyobo closed this Feb 9, 2023
@elyobo elyobo deleted the merge-inside-insert-overwrite branch February 9, 2023 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants